This is a document detailing how to process mapped NGS reads (sorted and indexed .bam files), and derive the SFS, site-wise estimates of Theta and Tajima’s D using ANGSD. This is based on the wiki entry here. At the moment this is estiamted based on several teosinte samples from the HapMap2 project. These are stored on Farm in:

The examples TIL*.bam are all teosinte, but TIL8 and TIL25 are Z. mays mexicana and will be removed from future analysis. The remiaing TIL*.bam accessions are all Zea mays parviglumis and can be analysed together.

In order to process the data the order of opporations should be roughly:

  1. Estiamte the SFS.
  2. Estimate Thetas
  3. Measure Statistics of deviation from neutrality.

Below is an example of work flow given in the wiki, enabling the eventual estimation of Tajima’s D.


    ./angsd -bam bam.filelist -doSaf 1 -anc chimpHg19.fa -GL 2 -P 24 -out out 
    ./misc/realSFS out.saf 20 -P 24 > out.sfs
    ./angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -GL 2
    ./misc/thetaStat make_bed out.thetas.gz
    #Estimate for every Chromosome/scaffold
    ./misc/thetaStat do_stat out.thetas.gz -nChr 20
    #Do a sliding window analysis based on the output from the make_bed command.
    ./misc/thetaStat do_stat out.thetas.gz -nChr 20 -win 50000 -step 10000  -outnames theta.thetasWindow.gz


Step1: Estimating the SFS in Teosinte (Zea mays ssp. parviglumis)

First estimate the site allele frequency likelihood. This requires several things listed below:

1. A file with listed, one per line, all the .bam files you want to analyse. You can grab the files you need by cd’ing to the dir they are in and executing this code on the command line.

ls $PWD/*.bam > bam.list


2. Choose which method you want to use with:

-doSaf [int 1-4]

There are four options listed in detail here.


3. Define your ancestral allele using the flag:

-anc <path/to/referencegenome>

In my case we do not know the ancestral allele state, which means instead of derived allele SFS we need a minor allele SFS (a folded SFS). We can still provide an ancestral estimate using the reference genome (B73), but once folding is complete in becomes a minor allele SFS. We need to specify that we want a folded SFS wiht:

-fold 1


4. Define the method for estimating Genotype Likelihoods:

-GL [int 1-4]

details of the different methods are provided here.


5. Define the number of processors to use with:

-P [int]


6. Define the outfile name using:

-out <path/to/outfile>

there are several output files and a suffix will be added to each file.


You can also define the region of the genome you would like to analyse. On the first pass we will limit our analysis of the genome to the first 500,000 bp of chromosome 10 using:

-r 10:1-100000

I have a bam.list file that has the following accessions listed:

/home/sbyfield/HapMap2Teo/TIL01_merged.bam
/home/sbyfield/HapMap2Teo/TIL02_merged.bam
/home/sbyfield/HapMap2Teo/TIL03_merged.bam
/home/sbyfield/HapMap2Teo/TIL04-TIP454_merged.bam
/home/sbyfield/HapMap2Teo/TIL05_merged.bam
/home/sbyfield/HapMap2Teo/TIL06-TIP496_merged.bam
/home/sbyfield/HapMap2Teo/TIL07_merged.bam
/home/sbyfield/HapMap2Teo/TIL09_merged.bam
/home/sbyfield/HapMap2Teo/TIL10_merged.bam
/home/sbyfield/HapMap2Teo/TIL11_merged.bam
/home/sbyfield/HapMap2Teo/TIL12_merged.bam
/home/sbyfield/HapMap2Teo/TIL14-TIP498_merged.bam
/home/sbyfield/HapMap2Teo/TIL15_merged.bam
/home/sbyfield/HapMap2Teo/TIL16_merged.bam
/home/sbyfield/HapMap2Teo/TIL17_merged.bam

So I can run the following command to get the initial estimation of site allele frequency likelihood (SAF):

angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 12 -r 10:1-10000000 

I extedned the region from 1:100,000 bp to 1:500,000 bp resulted in the follwoing error:

        -> Reading fasta: /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa
        -> Parsing 15 number of samples
        -> Printing at chr: 10 pos:5698 chunknumber 100
        -> Printing at chr: 10 pos:7154 chunknumber 200
        -> Printing at chr: 10 pos:26725 chunknumber 300
        -> Printing at chr: 10 pos:29944 chunknumber 400
        -> Printing at chr: 10 pos:41538 chunknumber 500
        -> Printing at chr: 10 pos:44283 chunknumber 600
        -> Printing at chr: 10 pos:46373 chunknumber 700
        -> Printing at chr: 10 pos:51852 chunknumber 800
        -> Printing at chr: 10 pos:51952 chunknumber 900
PROBS at: 10    52032
        -> Printing at chr: 10 pos:70089 chunknumber 1000
        -> Printing at chr: 10 pos:70974 chunknumber 1100
        -> Printing at chr: 10 pos:123765 chunknumber 1200
PROBS at: 10    143508
        -> Printing at chr: 10 pos:143560 chunknumber 1300
        -> Printing at chr: 10 pos:144692 chunknumber 1400
PROBS at: 10    146774
        -> Printing at chr: 10 pos:147193 chunknumber 1500
        -> Printing at chr: 10 pos:176348 chunknumber 1600
        -> Printing at chr: 10 pos:176962 chunknumber 1700
        -> Printing at chr: 10 pos:182529 chunknumber 1800
        -> Printing at chr: 10 pos:207307 chunknumber 1900
        -> Printing at chr: 10 pos:266486 chunknumber 2000
        -> Printing at chr: 10 pos:292877 chunknumber 2100
        -> Printing at chr: 10 pos:317113 chunknumber 2200
        -> Printing at chr: 10 pos:372899 chunknumber 2300
        -> Printing at chr: 10 pos:443060 chunknumber 2400
        -> Printing at chr: 10 pos:475084 chunknumber 2500

        -> Done reading data waiting for calculations to finish
        -> Done waiting for threads
        -> Output filenames:
                ->"smallFolded_10_100000.arg"
                ->"smallFolded_10_100000.saf"
                ->"smallFolded_10_100000.saf.pos.gz"
        -> Tue Oct 28 15:07:25 2014
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  43.92 sec
        [ALL done] walltime used =  25.00 sec

real    0m0.000s
user    0m0.000s
sys     0m0.000s

I examined extending the region to 1:1,000,000 bp with:

angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 12 -r 10:1-1000000 -minMapQ 1 -minQ 20

and got this error in the stderr log file:

        -> Reading fasta: /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa
        -> Parsing 15 number of samples
        -> Printing at chr: 10 pos:5698 chunknumber 100
        -> Printing at chr: 10 pos:7154 chunknumber 200
        -> Printing at chr: 10 pos:26725 chunknumber 300
        -> Printing at chr: 10 pos:29944 chunknumber 400
        -> Printing at chr: 10 pos:41538 chunknumber 500
        -> Printing at chr: 10 pos:44283 chunknumber 600
        -> Printing at chr: 10 pos:46373 chunknumber 700
        -> Printing at chr: 10 pos:51852 chunknumber 800
        -> Printing at chr: 10 pos:51952 chunknumber 900
PROBS at: 10    52032
        -> Printing at chr: 10 pos:70089 chunknumber 1000
        -> Printing at chr: 10 pos:70974 chunknumber 1100
        -> Printing at chr: 10 pos:123765 chunknumber 1200
PROBS at: 10    143508
        -> Printing at chr: 10 pos:143560 chunknumber 1300
        -> Printing at chr: 10 pos:144692 chunknumber 1400
PROBS at: 10    146774
        -> Printing at chr: 10 pos:147193 chunknumber 1500
        -> Printing at chr: 10 pos:176348 chunknumber 1600
        -> Printing at chr: 10 pos:176962 chunknumber 1700
        -> Printing at chr: 10 pos:182529 chunknumber 1800
        -> Printing at chr: 10 pos:207307 chunknumber 1900
        -> Printing at chr: 10 pos:266486 chunknumber 2000
        -> Printing at chr: 10 pos:292877 chunknumber 2100
        -> Printing at chr: 10 pos:317113 chunknumber 2200
        -> Printing at chr: 10 pos:372899 chunknumber 2300
        -> Printing at chr: 10 pos:443060 chunknumber 2400
        -> Printing at chr: 10 pos:475084 chunknumber 2500
        -> Printing at chr: 10 pos:537668 chunknumber 2600
        -> Printing at chr: 10 pos:549090 chunknumber 2700
        -> Printing at chr: 10 pos:574173 chunknumber 2800
        -> Printing at chr: 10 pos:589414 chunknumber 2900
        -> Printing at chr: 10 pos:615307 chunknumber 3000
        -> Printing at chr: 10 pos:633719 chunknumber 3100
        -> Printing at chr: 10 pos:651276 chunknumber 3200
        -> Printing at chr: 10 pos:667980 chunknumber 3300
        -> Printing at chr: 10 pos:687278 chunknumber 3400
        -> Printing at chr: 10 pos:703154 chunknumber 3500
        -> Printing at chr: 10 pos:730360 chunknumber 3600
        -> Printing at chr: 10 pos:750631 chunknumber 3700
        -> Printing at chr: 10 pos:770970 chunknumber 3800
        -> Printing at chr: 10 pos:789476 chunknumber 3900
        -> Printing at chr: 10 pos:808868 chunknumber 4000
        -> Printing at chr: 10 pos:829136 chunknumber 4100
        -> Printing at chr: 10 pos:861912 chunknumber 4200
        -> Printing at chr: 10 pos:888113 chunknumber 4300
        -> Printing at chr: 10 pos:911758 chunknumber 4400
        -> Printing at chr: 10 pos:935262 chunknumber 4500
        -> Printing at chr: 10 pos:958633 chunknumber 4600
        -> Printing at chr: 10 pos:974523 chunknumber 4700

        -> Done reading data waiting for calculations to finish
        -> Done waiting for threads
        -> Output filenames:
                ->"smallFolded_10_100000.arg"
                ->"smallFolded_10_100000.saf"
                ->"smallFolded_10_100000.saf.pos.gz"
        -> Tue Oct 28 15:16:32 2014
        -> Arguments and parameters for all analysis are located in .arg file
        [ALL done] cpu-time used =  82.42 sec
        [ALL done] walltime used =  38.00 sec

real    0m0.000s
user    0m0.000s

I extedned the region to 10,000,000 bp. Next it is probably wise to go ahead an see if I can move through the rest of the analyses and actually producce a folded SFS.

The next step is to convert the .saf file to a SFS using:

realSFS

There there is some confusion in the wiki for ANGSD about exactly how to do this, sometimes it says you need to declare \(2n+1\) chromosomes, sometimes \(2n\) chromosomes and sometimes \(n\), where \(n\) is the number of samples. I have found that using \(n\) gets ANGSD to run. For example, if I have 15 samples (as I do in this example) I will declare 15 chromosomes as an argument to realSFS:

realSFS teoFolded_chr10.saf 15 -maxIter 100 -P 12 > teoFolded_chr10.sfs

Here is the resulting SFS for the first 10,000,000 bp of chromosome 10 for 15 teosinte lines:

sfs<-c(-0.192947, -2.444628, -3.249441, -4.179364, -4.586453, -5.137299 ,-5.489250, -5.845022, -6.020604, -6.241799, -6.441396, -6.546019, -6.785617, -6.850590, -6.869085 ,-7.022943)
barplot(exp(sfs[-1]),col="darkgrey", names.arg=1:(length(sfs)-1), ylab="probability",xlab="minor allele freqquency", main = "Chr10:1:10,000,000")

This worked well, so I attempted the whole thing again for the entire Chr10 using:

>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -r 10 -P 12 -minMapQ 1 -minQ 20
>realSFS teoFolded_chr10.saf 15 -maxIter 100 -P 12 > teoFolded_chr10.sfs
sfs<-c(-0.206044, -2.398034, -3.210098, -4.072419, -4.525486, -5.026914, -5.395742, -5.723438, -5.943467, -6.157773, -6.327133, -6.439034, -6.542996, -6.667280, -6.892473 -7.044611)
barplot(exp(sfs[-1]),col="darkgrey", names.arg=1:(length(sfs)-1), ylab="probability",xlab="minor allele freqquency", main = "Chr10")

This worked well also so I extended it to the whole genome using:

>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 1 -minQ 20
>realSFS teoFolded.saf 15 -maxIter 100 -P 18 > teoFolded.sfs

Step2: Calculate the thetas for each site

The next step is too calculate site wise estimates of theta for the region of interest. This can be acheived with a command like this:

>angsd -bam bam.filelist -out out -doThetas 1 -doSaf 1 -pest out.sfs -anc chimpHg19.fa -GL 2

So for jsut the Chr10 data my command looks like:

>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoThetas_ch10 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000 -minMapQ 1 -minQ 20

So to take a look at the data you can do:

gunzip -c teoThetas_ch10.thetas.gz | head

The ouput looks like this..

#Chromo  Pos    Watterson   Pairwise    thetaSingleton  thetaH  thetaL
10  19  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  20  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  21  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  22  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  23  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  24  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  25  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  26  -3.057582   -3.592928   -Inf    -Inf    -Inf
10  27  -3.057582   -3.592928   -Inf    -Inf    -Inf

Note that the last three columns cannot be calculated with a folded spectrum. It’s not important for calculating Tajima’s D in this case. The next step is to create a BED file of the genome region.

#create a binary version of thete.thetas.gz 
misc/thetaStat make_bed teoThetas_ch10.thetas.gz
#calculate Tajimas D
./misc/thetaStat do_stat out.thetas.gz -nChr 20 -win 5000 -step 1000  -outnames teothetasWindow_chr10.gz

The outout of the call looks like this:

#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)  Chr  WinCenter   tW  tP  tF  tH  tL  Tajima  fuf fud fayh    zeng    nSites
(981,5553)(1000,6000)(1000,6000)    10  3500    229.075532  128.679660  0.000000    0.000000    0.000000    -1.951619   1.022695    2.198914    0.627387    -1.271722   4572
(1744,6553)(2000,7000)(2000,7000)   10  4500    252.158774  146.216524  0.000000    0.000000    0.000000    -1.871340   1.056264    2.200253    0.647711    -1.272027   4809
(2737,7373)(3000,9633)(3000,8000)   10  5500    245.769776  142.231370  0.000000    0.000000    0.000000    -1.876311   1.054038    2.199907    0.646415    -1.271948   4636
(3737,7373)(4000,9633)(4000,9000)   10  6500    198.005228  116.321264  0.000000    0.000000    0.000000    -1.836315   1.068549    2.196625    0.655984    -1.271199   3636
(4553,7740)(5000,10000)(5000,10000) 10  7500    160.708626  92.637354   0.000000    0.000000    0.000000    -1.884167   1.046818    2.192724    0.643425    -1.270307   3187
(5553,8515)(6000,12805)(6000,11000) 10  8500    135.715894  77.997890   0.000000    0.000000    0.000000    -1.890558   1.042092    2.188930    0.641280    -1.269437   2962
(6553,8515)(7000,12805)(7000,12000) 10  9500    84.359327   46.535357   0.000000    0.000000    0.000000    -1.988064   0.994260    2.174266    0.614661    -1.266046   1962
(7373,8695)(9633,13000)(8000,13000) 10  10500   62.033024   37.406809   0.000000    0.000000    0.000000    -1.755972   1.080761    2.160589    0.671020    -1.262847   1322

Here is description of the file format form the ANGSD wiki:

"The .pestPG file is a 14 column file (tab seperated). The first column contains information about the region. The second and third column is the reference name and the center of the window.
We then have 5 different estimators of theta, these are: Watterson, pairwise, FuLi, fayH, L. And we have 5 different neutrality test statistics: Tajima's D, Fu&Li F's, Fu&Li's D, Fay's H, Zeng's E. The final column is the effetive number of sites with data in the window."

Make an attempt to draw estimates of Tajima’s D across the whole of Chr10:

#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teothetasWindow_chr10.gz.pestPG", header = T, sep = "\t")
plot(TJD[,3],TJD[,9], pch = 16, cex=.5)

Now examine Tajima’s D with proportion of bases covered per window.

library(ggplot2)
#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teothetasWindow_chr10.gz.pestPG", header = T, sep = "\t")
plot(TJD[,14]/5000,TJD[,9], pch = 16, cex=.5)

#make a data table to plot with ggplot
TajimaD<-data.frame("pos"=TJD[,3],"prop"=TJD[,3],"TD"=TJD[,9])
plt <- ggplot(TajimaD, aes(x= prop,y=TD))+ 
  geom_point()+
    stat_summary(geom="ribbon", fun.ymin="mean", fun.ymax="mean")
plt

I spoke with Jeff and he indicated that the parameters are probably too liberal, resultingin bad mapping and an over-estimation of rare alleles (hence negative TD) . For example, he tells me that simulations indicated that BWA-MEM maps better alignments with ninimum aligment score of 40. So, I will re-run the analysis with modified parameters detailed below:

#the minimum number of indivduals with at least 1 read.
-minInd [int]
#The m
-minMapQ 20 

-minQ 40

I will re-run analyses with the first 10,000,000 bp of chromosome 10 using the following command:

>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_10000000 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 12 -r 10:1-1000000 -minMapQ 20 -minQ 40 

>realSFS teoFolded_chr10_10000000.saf 15 -maxIter 100 -P 12 > teoFolded_chr10_10000000.sfs

using the following slurm batch file:

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
set -e
set -u

echo "Starting job.."
echo ""

#for the sfs defining, in this case, a false ancestral allele (i.e. the B73 genome)
echo "angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_ch10_10000000 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 20 -minQ 40 -r 10:1-10000000"
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_ch10_10000000 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 20 -minQ 40 -r 10:1-10000000
#make it into a folded SFS, or minor allele spectrum
echo "realSFS teoFolded_ch10_10000000.saf 15 -maxIter 100 -P 18 > teoFolded_ch10_10000000.sfs"
realSFS teoFolded_ch10_10000000.saf 15 -maxIter 100 -P 18 > teoFolded_ch10_10000000.sfs
echo ""
time
echo ""
echo "done"

the JOB ID is: 1480928

Following this I used ANGSD to estimate theta:

angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoThetas_ch10_10000000 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10_100000.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000

Using this file:


#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u

echo "Starting job.."
time
echo ""

angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoThetas_ch10_10000000 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_ch10_10000000.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000 -minMapQ 20 -minQ 40

echo ""
time
echo ""
echo "done"

ID = 1483211

Then I can estimate TajD over sliding windows using:

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u

echo ""
time
echo "making BED"

thetaStat make_bed teoThetas_ch10_10000000.thetas.gz
thetaStat do_stat teoThetas_ch10_10000000.thetas.gz -nChr 15 -win 5000 -step 1000  -outnames teothetasWindow_chr10_10000000.gz

echo "done"
time

The output look like this:

## thetaStat VERSION: 0.01 build:(Oct 23 2014,15:39:05)
#(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)  Chr  WinCenter   tW  tP  tF  tH  tL  Tajima  fuf fud fayh    zeng    nSites
(71,388)(300,1530)(300,800) 10  550 10.460646   6.107603    0.000000    0.000000    0.000000    -1.762926   0.951768    1.945277    0.634197    -1.207356   317
(137,388)(400,1530)(400,900)    10  650 8.275646    4.823003    0.000000    0.000000    0.000000    -1.744697   0.924989    1.889293    0.628344    -1.191207   251
(237,388)(500,1530)(500,1000)   10  750 4.888378    2.850494    0.000000    0.000000    0.000000    -1.673753   0.854698    1.733225    0.613838    -1.141940   151
(296,388)(611,1530)(600,1100)   10  850 3.259762    2.020438    0.000000    0.000000    0.000000    -1.458366   0.836464    1.586624    0.634632    -1.089353   92
(346,388)(700,1530)(700,1200)   10  950 1.506088    0.941821    0.000000    0.000000    0.000000    -1.258182   0.681794    1.267787    0.587341    -0.950147   42
(388,388)(1530,1530)(800,1300)  10  1050    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    -nan    -nan    -nan    -nan    0
(388,388)(1530,1530)(900,1400)  10  1150    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    -nan    -nan    -nan    -nan    0
(388,388)(1530,1530)(1000,1500) 10  1250    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    -nan    -nan    -nan    -nan    0
library(ggplot2)
#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teothetasWindow_chr10_10000000.gz.pestPG", header = T, sep = "\t")
plot(TJD[,14]/5000,TJD[,9], pch = 16, cex=.5, xlab="proportion of bases covered", ylab = "Tajima's D")

plot(TJD[,3]/(10^6),TJD[,9], pch = 16, cex=.5, xlab = "position (Mbp)",ylab = "Tajima's D")

#make a data table to plot with ggplot
#TajimaD<-data.frame("pos"=TJD[,3],"prop"=TJD[,3],"TD"=TJD[,9])
#plt <- ggplot(TajimaD, aes(x= prop,y=TD))+ 
#  geom_point()+
#    stat_summary(geom="ribbon", fun.ymin="mean", fun.ymax="mean")
#plt

The mapping quality seems to be an issue and also I need to limit the number of sites to only those where -minInd == 12 of the 15 samples. Note that:

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18
set -e
set -u

echo "Starting job.."
echo ""

#for the sfs defining, in this case, a false ancestral allele (i.e. the B73 genome)
echo "angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_10M_minInd12 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-10000000 -fold 1 -P 18 -minMapQ 40 -minQ 40 -minInd 12"
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_10M_minInd12 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-10000000 -fold 1 -P 18 -minMapQ 40 -minQ 40 -minInd 12
#make it into a folded SFS, or minor allele spectrum
echo "realSFS teoFolded_chr10_10M_minInd12.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_10M_minInd12.sfs"
realSFS teoFolded_chr10_10M_minInd12.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_10M_minInd12.sfs
echo ""
time
echo ""
echo "done"

Next step is to calculate thetas over the genomic region of interest using:

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u

echo "Starting job.."
time
echo ""

>angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded_chr10_10M_minInd12.sfs -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10_10M_minInd12.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-10000000 -minMapQ 40 -minQ 40 minInd 12

echo ""
time
echo ""
echo "done"

Now to scan the genome by window, so we can plot the data using:

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=12
set -e
set -u

echo ""
time
echo "making BED"

thetaStat make_bed teoFolded_chr10_10M_minInd12.sfs.thetas.gz
thetaStat do_stat teoFolded_chr10_10M_minInd12.sfs.thetas.gz -nChr 15 -win 5000 -step 1000  -outnames teoFolded_chr10_10M_minInd12

echo "done"
time

Jeff notes that these line (from HapMap2) are all selfed for many generations and so need the expected heterozygosity will change. These samples are Inbred. You can indicate this to ANGSD with an inbreeding file* containing the appropriate inbreeding coefficient for each sample. In this case 1.

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18

#angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doSaf 1 -out teoFolded_chr10_100M_minInd15 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-100000000 -fold 1 -P 18 -minMapQ 20 -minQ 40 -minInd 15 -indF InbreedingCoEff.txt
realSFS teoFolded_chr10_100M_minInd15.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_100M_minInd15.sfs 
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded_chr10_100M_minInd15 -doThetas 1 -doSaf 1 -fold 1 -pest teoFolded_chr10_100M_minInd15.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-100000000 -minMapQ 30 -minQ 40 minInd 15 -indF InbreedingCoEff.txt
thetaStat make_bed teoFolded_chr10_100M_minInd15.sfs.thetas.gz
thetaStat do_stat teoFolded_chr10_100M_minInd15.sfs.thetas.gz -nChr 15 -win 5000 -step 1000  -outnames teoFolded_chr10_100M_minInd15 

echo "done"
time
library(ggplot2)
#load in the data
TJD<-read.table("/Users/simonrenny-byfield/test_angst/teoFolded_chr10_100M_minInd15.pestPG", header = T, sep = "\t")
plot(TJD[,14]/5000,TJD[,9], pch = 16, cex=.5, xlab="proportion of bases covered", ylab = "Tajima's D")

scatter.smooth(x=TJD[,14]/5000, y=TJD[,9], cex = 0.01,col = "red", lwd = 3, lty = 3, span = 0.3, family = "gaussian")

plot(TJD[,14]/5000,(TJD[,5]/TJD[,14]), pch = 16, cex=.5, xlab="proportion of bases covered", ylab = "pi")

scatter.smooth(x=TJD[,14]/5000, y=(TJD[,5]/TJD[,14]), cex = 0.01,col = "red", lwd = 3, lty = 3, span = 0.3, family = "gaussian")

plot(TJD[,3]/(10^6),TJD[,9], pch = 16, cex=.5, xlab = "position (Mbp)",ylab = "Tajima's D")

hist(TJD[,9][TJD[,9] != 0], col = "cornflowerblue", xlab="Tajima's D")

#make a data table to plot with ggplot
#TajimaD<-data.frame("pos"=TJD[,3],"prop"=TJD[,3],"TD"=TJD[,9])
#plt <- ggplot(TajimaD, aes(x= prop,y=TD))+ 
#  geom_point()+
#    stat_summary(geom="ribbon", fun.ymin="mean", fun.ymax="mean")
#plt

Retry the analysis with a few parameter changes:

most importantly I was changed the -doSaf paramter to 2 (-doSaf 2), which DOES NOT assume HWE like the -doSaf 1 option DOES.

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18

#angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doMaf 1 -uniqueOnly 0 -doMajorMinor 1 -doSaf 2 -out teoFolded_chr10_100M_minInd15 -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -r 10:1-100000000 -fold 1 -P 18 -minMapQ 30 -minQ 20 -minInd 15 -indF InbreedingCoEff.txt
#realSFS teoFolded_chr10_100M_minInd15.saf 15 -maxIter 100 -P 18 > teoFolded_chr10_100M_minInd15.sfs
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded_chr10_100M_minInd15.sfs -doThetas 1 -doSaf 2 -fold 1 -pest teoFolded_chr10_100M_minInd15.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 12 -r 10:1-100000000 -minMapQ 30 -minQ 20 -indF InbreedingCoEff.txt -doMajorMinor 1 -doMaf 1
thetaStat make_bed teoFolded_chr10_100M_minInd15.sfs.thetas.gz
thetaStat do_stat teoFolded_chr10_100M_minInd15.sfs.thetas.gz -nChr 15 -win 5000 -step 1000  -outnames teoFolded_chr10_100M_minInd15

echo "done"
time

Full Genome Comparisons

I removed the minimum coverage (of individuals, i.e. minInd). Once the analysis is run we can then parse the data according to how many individuals are represented for any given position. I set antother job running for the whole genome using this command (in a slurm batch file):

#!/bin/bash
#OUTDIR=/home/sbyfield/HapMap2Teo
#SBATCH -D /home/sbyfield/HapMap2Teo/
#SBATCH -o /home/sbyfield/HapMap2Teo/slurm-log/stdout_log-%j.txt
#SBATCH -e /home/sbyfield/HapMap2Teo/slurm-log/stderr_log-%j.txt
#SBATCH -J teo
#SBATCH --cpus-per-task=18

echo "Starting jog:"
date

angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -doMaf 1 -uniqueOnly 0 -doMajorMinor 1 -doSaf 2 -out teoFolded -anc /home/sbyfield/HapMap2Teo/Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -fold 1 -P 18 -minMapQ 30 -minQ 20 -indF InbreedingCoEff.txt
realSFS teoFolded.saf 15 -maxIter 100 -P 18 > teoFolded.sfs
angsd -bam /home/sbyfield/HapMap2Teo/teo.bam.file.list.txt -out teoFolded.sfs -doThetas 1 -doSaf 2 -fold 1 -pest teoFolded.sfs -anc Zea_mays.AGPv3.22.dna.genome.fa -GL 2 -P 18 -minMapQ 30 -minQ 20 -indF InbreedingCoEff.txt -doMajorMinor 1 -doMaf 1
thetaStat make_bed teoFolded.sfs.thetas.gz
thetaStat do_stat teoFolded.sfs.thetas.gz -nChr 15 -win 5000 -step 1000  -outnames teoFolded.thetasWindow -P 18

echo "done"
date